Feature/issue 3311 test thread tbb exp#3314
Conversation
parallel_for, blocked range compiles for stan::math::exp compiling blocked_range works fine some progress, now a type deduction issue? ok something closer... implement struct version for parallel_for... uncompiled begin new class to use parallel for almost compiles... getting close, have template deduction failed which we can figure out almost compiles hold on compiles remove dead code compiled parallel_for, blocked_range for stan::math::exp compiled parallel_for, blocked_range for stan::math::exp
|
Hold on, sorry I should re-base. I have some questions, wondering if anyone had comments or is this all on me? Refactor, and using threads at lower number of observations. |
…rezap/math into feature/issue-3311-test-thread-tbb-exp
|
Do you have a graph that shows the speedup? Overall I'd be kind of cautious introducing lower level threading like this. Like you saw, whether you get a speedup or slowdown depends a lot on the number of observations. So for every vector operations we would have to have a check that the size exceeded some threshold. That threshold is going to vary a lot per computer and I think I think if we are not careful could make the codebase kind of funky. The other piece here is that this works for |
|
I’m thinking about, haven’t thought too far ahead yet, thank you.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Thu, Apr 30, 2026 at 5:46 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
Do you have a graph that shows the speedup? Overall I'd be kind of
cautious introducing lower level threading like this. Like you saw, whether
you get a speedup or slowdown depends a lot on the number of observations.
So for every vector operations we would have to have a check that the size
exceeded some threshold. That threshold is going to vary a lot per computer
and I think I think if we are not careful could make the codebase kind of
funky.
The other piece here is that this works for prim functions of double
type, but parallelism is much harder for reverse mode which is the main
piece of the math library we worry about. The main issue is handling how
the global AD tape should sync when we have jobs across N threads.
@andrjohns <https://github.com/andrjohns> thought for a long while trying
to figure out how to do a nice parallel map(...) style function for
reverse mode autodiff. I'm not sure he came up with something he found
satisfying. I have not either honestly. Essentially you need to shard the
operation over N shards which will have N autodiff stacks, then once the
parallel computation is done you have to pass those autodiff stacks back
and put them onto the main thread's stack. So there you would get
performance benefits for setting up the forward pass in parallel, but then
the reverse pass would still be serial and you pay the cost of the sharding
and thread startup. I'm very certain there is a way to do it so you can do
the forward and reverse pass in parallel, but nothing has ever come to me
for this problem.
—
Reply to this email directly, view it on GitHub
<#3314 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543AUG7YY66QH7E5MAKL4YPCSVAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGNJWGM3DSNBVGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
|
I'm doing continuous integration tests, it looks like it's mostly passing now.
And I need to consider threading the rev autodiff stack, that would be cool, if different threads could build different expression trees, I think that's what Steve was saying. But if this adds incremental speed increase, why not? WRT Steves comment I can think about it, but here I'm not parallelizing anything on the stack, just evaluation of the computation of |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
|
Not sure why Jenkins emailed me SUCCESS when there's so many errors? I'm not seeing these locally. I also named the branch wrong, but I'll just leave it until it's closed... |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
This reverts commit 7ca2f6d.
|
Forwarding made it way slower. I'm burning resources but I want to see if that forwarding caused a huge slowdown. I reverted the last commit. |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
|
And then what happened between the last 3 commits (two commits and a revert to the first in the last three) that made the speed reduce? Is it just terminating runs early because rev/mix/fwd isn't passing? Not sure why the speed increase at Commit hash: e0729e1cdec40e8ec3da60b40b20a2cfc223fc94. |
|
Any comments about these 3 benchmarks? The only thing I changed was adding perfect forwarding (Containter &&) in the function input on run 2, and it seemed to slow it down. I have some suppositions but someone with more expertise may have an idea. I need to read about it. More benchmarks would waste a lot more resources, I'm not sure the std. dev. of runtimes. |
|
I wouldn’t give much credence to the model benchmarks unless you know they’re using a function you’re editing or they change by a huge amount. I don’t even know that they run on the same hardware every time. you’ll want to do your own specific benchmarking for a proposal like this |
|
And I'm trying to re-create some of these locally so I'm not wasting resources, and I can test it locally but I'm not able to. I'm using gcc=13.0 it looks like, am I using the wrong compiler? when I run |
|
Ok looks like gcc 9.0. I left a hanging 1 after an include and gcc still compiled it. |
Jenkins Console Log Machine informationNo LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focalCPU: G++: Clang: |
|
Docker/Jenkins I'm getting set up. But it looks like it's passed on jenkins: https://jenkins.flatironinstitute.org/blue/organizations/jenkins/Stan%2FMath/detail/PR-3314/11/pipeline But not showing up on github? Am I doing something wrong? |
|
The most recent commit appears to have passed on Jenkins and failed about half of the Github Actions checks, seemingly all the Windows-based ones. |
|
@drezap it would be better to test your new code via Google benchmark instead of running jenkins a bunch. I have a repo below with everything setup to run Google benchmark and Stan with specific branches. |
|
@SteveBronder Thank you, very helpful. And I'm taking a closer look at Jenkins: https://jenkins.flatironinstitute.org/blue/organizations/jenkins/Stan%2FMath/activity?branch=PR-3314, Am I correct to say that push Thank you. |
Looking at the jenkins it seems like all of the commits after 6837a52 passed, though it seems like your current commit is passing jenkins while failing on the other CI. So I'm not sure if those commits passed the other CI either
Do you mean the results from the stan build-bot running the performance regression tests? For a lower level change like this those performance tests are probably too high level for us to be able to reason about the effects. You should set this up as a google benchmark test. That should be nice to run and analyze the results locally. I also want to reference my earlier comment #3314 (comment) . "When will parallelism be worth it" is going to be a pretty hard question to decide at runtime and I'm not confident we want that complexity just for speeding up the |
EDIT: Reading some literature today, this agreed with some my thoughts about speed when distributing and collecting threads. I was looking at, C++ Concurrency in Action: Practical Multithreading, Williams 2012. But when I removed const and passed by reference and added an lvalue instantiation
Ok, but if concurrency (multithreading) simple stuff adds performance gains, worth adding, add it, if not, I'm not offended. For reverse mode autodiff, can you give me a more formal project spec in an issue? Then I'll look into it. You're seeing if, given a functions_i, f(.), we can send a different thread through each function to build the expression tree in parallel? So suppose we need to compute derivatives for f(.) and g(.), we want to build two expression trees at once using concurrency? I'm trying to specify the problem more clearly. Perhaps I'm not understanding. EDIT: WRT Travis CI: The last updates I'm seeing are from 3 years ago? Am I missing something? |
I would recommend forking the https://github.com/SteveBronder/stan-perf
The main issue is, when does the overhead of threading make it worth running in parallel? i.e. if a user has a vector of 100 elements, how many threads should be used for
I'll try to do a writeup this week or next. There was a previous discussion here on doing this which shows a bit of the scope. |
|
I am talking to myself but computer was stolen, I plan on continuing this
please do not delete this PR. I'm going to revert to the faster version and
then try to use Steve's test suite. But no computer.
Cheers everyone, happy developing. :)
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Mon, May 11, 2026, 2:43 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
And then the benchmarks were run with these scripts, on this branch:
`test/unit/math/prim/fun/exp_test.cpp'. But I was modifying the typing, and
you can reproduce it via the pushes. If not, I'm down to do a screen share
and I can just show you. May be 10-15 minutes.
I would recommend forking the stan-perf repo. That has everything setup
for benchmarking for Stan. When you use googlebench via that repo it is
also easy to export the results of the benchmarks to json or csv . Plus
google benchmark has a lot of tools for testing multiple matrix sizes and
the number of threads
https://github.com/SteveBronder/stan-perf
Ok, but if concurrency (multithreading) simple stuff adds performance
gains, worth adding, add it, if not, I'm not offended.
The main issue is, when does the overhead of threading make it worth
running in parallel? i.e. if a user has a vector of 100 elements, how many
threads should be used for exp(x)? None? Two? Threading has a decently
high overhead cost and for each instantiation of threading you pay for
that. So for small problems the answer is most likely single threaded. When
we are trying to add automatic parallelization we will need to ask at
runtime "how many threads should this operation use given the data size?"
that requires understanding a lot of information about how fast a users
particular machine can calculate a function and is a pretty hard problem.
Eigen does this, but only for matrix multiplication as they have very good
runtime logic to detect if sharding a large matrix multiply across threads
is worth it.
For reverse mode autodiff, can you give me a more formal project spec in
an issue? Then I'll look into it. You're seeing if, given a functions_i,
f(.), we can send a different thread through each function to build the
expression tree in parallel? So suppose we need to compute derivatives for
f(.) and g(.), we want to build two expression trees at once using
concurrency? I'm trying to specify the problem more clearly. Perhaps I'm
not understanding.
I'll try to do a writeup this week or next. There was a previous
discussion here <#1918> on doing
this which shows a bit of the scope.
—
Reply to this email directly, view it on GitHub
<#3314 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543EN5OL7WN2X76AXZ2L42INN3AVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DIMRTG42DGMJXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Sorry to hear your computer was stolen! Until we have a better idea of what this PR should cover and the benchmarks are more clear I think we should close the PR. Which, to be clear, will not delete the branch with your code. All code will still be accessible via the branch issue-3311-test-thread-tbb-exp |
|
If you'd like to more clearly define what should be covered, via discourse,
that would be great. Until then, I have no reason to see why this should be
closed.
What would you like to see in order for merge to happen?
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 12:27 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
Sorry to hear your computer was stolen! Until we have a better idea of
what this PR should cover and the benchmarks are more clear I think we
should close the PR. Which, to be clear, will not delete the branch with
your code. All code will still be accessible via the branch
issue-3311-test-thread-tbb-exp
<https://github.com/drezap/math/tree/feature/issue-3311-test-thread-tbb-exp>
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543D2KEX7LWSQVZQLBI3444JNTA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVGY2DQNJRGAYKM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4556485100>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543BL3L7UBGGOVELCB7D444JNTAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJWGQ4DKMJQGA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
I have clearly shown that this increases speed. I'm open to suggestions as
to what might increase execution speed. I couldn't set up your benchmark
report quickly enough.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 12:44 PM Andre Zapico ***@***.***> wrote:
If you'd like to more clearly define what should be covered, via
discourse, that would be great. Until then, I have no reason to see why
this should be closed.
What would you like to see in order for merge to happen?
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 12:27 PM Steve Bronder ***@***.***>
wrote:
> *SteveBronder* left a comment (stan-dev/math#3314)
> <#3314 (comment)>
>
> Sorry to hear your computer was stolen! Until we have a better idea of
> what this PR should cover and the benchmarks are more clear I think we
> should close the PR. Which, to be clear, will not delete the branch with
> your code. All code will still be accessible via the branch
> issue-3311-test-thread-tbb-exp
> <https://github.com/drezap/math/tree/feature/issue-3311-test-thread-tbb-exp>
>
> —
> Reply to this email directly, view it on GitHub
> <#3314?email_source=notifications&email_token=ACY543D2KEX7LWSQVZQLBI3444JNTA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVGY2DQNJRGAYKM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4556485100>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACY543BL3L7UBGGOVELCB7D444JNTAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJWGQ4DKMJQGA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
I'm reposting my reply from here
The logic for deciding at runtime whether a particular function is worth moving over to the parallel cpu version is going to be a lot of developer and runtime overhead. imo I think the maintanence would not be worth it. This has been attempted previously be @andrjohns (and I took a crack at it myself). You can see that whole conversation here EDIT: Accidentally said gpu |
|
It's like some declarations, which are essentially just if statements that
determine whether a certain area of code will be compiled or not.
I'm skimming this I'm waiting on a bootloader for a free Mac I got.
I'm not seeing any direct comparisons between threaded and non threaded
code, and there seems to be a discrepancy between concurrency and running a
process on different cores.
I'm with Bob:
#1918 (comment)
Instead of chatting, let's come up with a concrete way of determining
whether something is faster.
WRT maintenance, it's like 3 lines of code and some declaratives. Easy to
maintain.
I seem to have accidentally discovered Ahmdal's law. So I propose we come
up with concrete objectives to benchmarks and if it's faster we proceed.
Also, typing matters.
And I'm not sure about how reliable the posteriorDB estimates are, but
locally in Stan/math parallelization with tbb was faster within limits
(#threads matters, etc) but if running this on exp with many evaluations of
a gaussian distribution for example for thousands of iterations this could
be worth it. But to play devil's advocate, recollecting threads could also
also slow it down.
Again, I'm handicapped no computer.
But in summary, I don't think the linked thread effectively evaluates
whether this is faster or not. All HPC devs use threading, no? Any ringers
we can bring in?
But wrt maintenance, easy.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***> wrote:
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I'm reposting my reply from here
<#3314 (comment)>
The main issue is, when does the overhead of threading make it worth
running in parallel? i.e. if a user has a vector of 100 elements, how many
threads should be used for exp(x)? None? Two? Threading has a decently high
overhead cost and for each instantiation of threading you pay for that. So
for small problems the answer is most likely single threaded. When we are
trying to add automatic parallelization we will need to ask at runtime "how
many threads should this operation use given the data size?" that requires
understanding a lot of information about how fast a users particular
machine can calculate a function and is a pretty hard problem. Eigen does
this, but only for matrix multiplication as they have very good runtime
logic to detect if sharding a large matrix multiply across threads is worth
it.
The logic for deciding at runtime whether a particular function is worth
moving over to the gpu is going to be a lot of developer and runtime
overhead. imo I think the maintanence would not be worth it.
This has been attempted previously be @andrjohns
<https://github.com/andrjohns> (and I took a crack at it myself). You can
see that whole conversation here
<#1918>
—
Reply to this email directly, view it on GitHub
<#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
And again, in AnderJohns thread I'm not seeing any direct comparisons
between threaded and unthreaded. I.e. there's no control and treatment
group. We can't just guess. We need to systematically evaluate it.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> wrote:
It's like some declarations, which are essentially just if statements that
determine whether a certain area of code will be compiled or not.
I'm skimming this I'm waiting on a bootloader for a free Mac I got.
I'm not seeing any direct comparisons between threaded and non threaded
code, and there seems to be a discrepancy between concurrency and running a
process on different cores.
I'm with Bob:
#1918 (comment)
Instead of chatting, let's come up with a concrete way of determining
whether something is faster.
WRT maintenance, it's like 3 lines of code and some declaratives. Easy to
maintain.
I seem to have accidentally discovered Ahmdal's law. So I propose we come
up with concrete objectives to benchmarks and if it's faster we proceed.
Also, typing matters.
And I'm not sure about how reliable the posteriorDB estimates are, but
locally in Stan/math parallelization with tbb was faster within limits
(#threads matters, etc) but if running this on exp with many evaluations of
a gaussian distribution for example for thousands of iterations this could
be worth it. But to play devil's advocate, recollecting threads could also
also slow it down.
Again, I'm handicapped no computer.
But in summary, I don't think the linked thread effectively evaluates
whether this is faster or not. All HPC devs use threading, no? Any ringers
we can bring in?
But wrt maintenance, easy.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***>
wrote:
> *SteveBronder* left a comment (stan-dev/math#3314)
> <#3314 (comment)>
>
> I'm reposting my reply from here
> <#3314 (comment)>
>
> The main issue is, when does the overhead of threading make it worth
> running in parallel? i.e. if a user has a vector of 100 elements, how many
> threads should be used for exp(x)? None? Two? Threading has a decently high
> overhead cost and for each instantiation of threading you pay for that. So
> for small problems the answer is most likely single threaded. When we are
> trying to add automatic parallelization we will need to ask at runtime "how
> many threads should this operation use given the data size?" that requires
> understanding a lot of information about how fast a users particular
> machine can calculate a function and is a pretty hard problem. Eigen does
> this, but only for matrix multiplication as they have very good runtime
> logic to detect if sharding a large matrix multiply across threads is worth
> it.
>
> The logic for deciding at runtime whether a particular function is worth
> moving over to the gpu is going to be a lot of developer and runtime
> overhead. imo I think the maintanence would not be worth it.
>
> This has been attempted previously be @andrjohns
> <https://github.com/andrjohns> (and I took a crack at it myself). You
> can see that whole conversation here
> <#1918>
>
> —
> Reply to this email directly, view it on GitHub
> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
|
And I answered these questions with my benchmarks, so it's not a big
mystery: Amdahl's law seems to apply.
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I'm reposting my reply from here
<#3314 (comment)>
The main issue is, when does the overhead of threading make it worth
running in parallel? i.e. if a user has a vector of 100 elements, how many
threads should be used for exp(x)? None? Two? Threading has a decently high
overhead cost and for each instantiation of threading you pay for that. So
for small problems the answer is
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***> wrote:
And again, in AnderJohns thread I'm not seeing any direct comparisons
between threaded and unthreaded. I.e. there's no control and treatment
group. We can't just guess. We need to systematically evaluate it.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> wrote:
> It's like some declarations, which are essentially just if statements
> that determine whether a certain area of code will be compiled or not.
>
> I'm skimming this I'm waiting on a bootloader for a free Mac I got.
>
> I'm not seeing any direct comparisons between threaded and non threaded
> code, and there seems to be a discrepancy between concurrency and running a
> process on different cores.
>
> I'm with Bob:
> #1918 (comment)
>
> Instead of chatting, let's come up with a concrete way of determining
> whether something is faster.
>
> WRT maintenance, it's like 3 lines of code and some declaratives. Easy to
> maintain.
>
> I seem to have accidentally discovered Ahmdal's law. So I propose we come
> up with concrete objectives to benchmarks and if it's faster we proceed.
> Also, typing matters.
>
> And I'm not sure about how reliable the posteriorDB estimates are, but
> locally in Stan/math parallelization with tbb was faster within limits
> (#threads matters, etc) but if running this on exp with many evaluations of
> a gaussian distribution for example for thousands of iterations this could
> be worth it. But to play devil's advocate, recollecting threads could also
> also slow it down.
>
> Again, I'm handicapped no computer.
>
> But in summary, I don't think the linked thread effectively evaluates
> whether this is faster or not. All HPC devs use threading, no? Any ringers
> we can bring in?
>
> But wrt maintenance, easy.
>
> Best,
>
>
> Andre Zapico
> linkedin.com/in/andre-zapico
> gitub.com/drezap
>
>
> ME Information and Communication Engineering
> University of Electronic Science and Technology of China
>
> Consultant, Owner
> likely llc
> likelyllc.com
>
> Stan Developer
> mc-stan.org
>
> BS Mathematical Sciences: Probabilistic Methods
> BS Statistics
> University of Michigan, Ann Arbor 2017
>
> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***>
> wrote:
>
>> *SteveBronder* left a comment (stan-dev/math#3314)
>> <#3314 (comment)>
>>
>> I'm reposting my reply from here
>> <#3314 (comment)>
>>
>> The main issue is, when does the overhead of threading make it worth
>> running in parallel? i.e. if a user has a vector of 100 elements, how many
>> threads should be used for exp(x)? None? Two? Threading has a decently high
>> overhead cost and for each instantiation of threading you pay for that. So
>> for small problems the answer is most likely single threaded. When we are
>> trying to add automatic parallelization we will need to ask at runtime "how
>> many threads should this operation use given the data size?" that requires
>> understanding a lot of information about how fast a users particular
>> machine can calculate a function and is a pretty hard problem. Eigen does
>> this, but only for matrix multiplication as they have very good runtime
>> logic to detect if sharding a large matrix multiply across threads is worth
>> it.
>>
>> The logic for deciding at runtime whether a particular function is worth
>> moving over to the gpu is going to be a lot of developer and runtime
>> overhead. imo I think the maintanence would not be worth it.
>>
>> This has been attempted previously be @andrjohns
>> <https://github.com/andrjohns> (and I took a crack at it myself). You
>> can see that whole conversation here
>> <#1918>
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
>
>
>
>
|
|
What I am suggesting is we ignore MCMC for now, and just go with runtime at
evaluating prob distributions. Pretty much all of them use an exponential.
So if there's a slight gain on evaluating computations then it's totally
worth it to add, no? I think a lot of developers do this under the hood but
don't expose it to users. Do the threads navigate through composite
functions (i.e. normal distribution)? no idea. but the tests I ran seemed
to improve performance, if we're not considering auto diff. they passed
tests. I am going for performance, not fancy publications if that makes
sense. But I'm sure devs do this under the hood for game dev etc. The code
I added was only a few lines, and some declaratives.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 5:25 PM Andre Zapico ***@***.***> wrote:
And I answered these questions with my benchmarks, so it's not a big
mystery: Amdahl's law seems to apply.
*SteveBronder* left a comment (stan-dev/math#3314)
<#3314 (comment)>
I'm reposting my reply from here
<#3314 (comment)>
The main issue is, when does the overhead of threading make it worth
running in parallel? i.e. if a user has a vector of 100 elements, how many
threads should be used for exp(x)? None? Two? Threading has a decently high
overhead cost and for each instantiation of threading you pay for that. So
for small problems the answer is
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***> wrote:
> And again, in AnderJohns thread I'm not seeing any direct comparisons
> between threaded and unthreaded. I.e. there's no control and treatment
> group. We can't just guess. We need to systematically evaluate it.
>
> Best,
>
>
> Andre Zapico
> linkedin.com/in/andre-zapico
> gitub.com/drezap
>
>
> ME Information and Communication Engineering
> University of Electronic Science and Technology of China
>
> Consultant, Owner
> likely llc
> likelyllc.com
>
> Stan Developer
> mc-stan.org
>
> BS Mathematical Sciences: Probabilistic Methods
> BS Statistics
> University of Michigan, Ann Arbor 2017
>
> On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> wrote:
>
>> It's like some declarations, which are essentially just if statements
>> that determine whether a certain area of code will be compiled or not.
>>
>> I'm skimming this I'm waiting on a bootloader for a free Mac I got.
>>
>> I'm not seeing any direct comparisons between threaded and non threaded
>> code, and there seems to be a discrepancy between concurrency and running a
>> process on different cores.
>>
>> I'm with Bob:
>> #1918 (comment)
>>
>> Instead of chatting, let's come up with a concrete way of determining
>> whether something is faster.
>>
>> WRT maintenance, it's like 3 lines of code and some declaratives. Easy
>> to maintain.
>>
>> I seem to have accidentally discovered Ahmdal's law. So I propose we
>> come up with concrete objectives to benchmarks and if it's faster we
>> proceed. Also, typing matters.
>>
>> And I'm not sure about how reliable the posteriorDB estimates are, but
>> locally in Stan/math parallelization with tbb was faster within limits
>> (#threads matters, etc) but if running this on exp with many evaluations of
>> a gaussian distribution for example for thousands of iterations this could
>> be worth it. But to play devil's advocate, recollecting threads could also
>> also slow it down.
>>
>> Again, I'm handicapped no computer.
>>
>> But in summary, I don't think the linked thread effectively evaluates
>> whether this is faster or not. All HPC devs use threading, no? Any ringers
>> we can bring in?
>>
>> But wrt maintenance, easy.
>>
>> Best,
>>
>>
>> Andre Zapico
>> linkedin.com/in/andre-zapico
>> gitub.com/drezap
>>
>>
>> ME Information and Communication Engineering
>> University of Electronic Science and Technology of China
>>
>> Consultant, Owner
>> likely llc
>> likelyllc.com
>>
>> Stan Developer
>> mc-stan.org
>>
>> BS Mathematical Sciences: Probabilistic Methods
>> BS Statistics
>> University of Michigan, Ann Arbor 2017
>>
>> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***>
>> wrote:
>>
>>> *SteveBronder* left a comment (stan-dev/math#3314)
>>> <#3314 (comment)>
>>>
>>> I'm reposting my reply from here
>>> <#3314 (comment)>
>>>
>>> The main issue is, when does the overhead of threading make it worth
>>> running in parallel? i.e. if a user has a vector of 100 elements, how many
>>> threads should be used for exp(x)? None? Two? Threading has a decently high
>>> overhead cost and for each instantiation of threading you pay for that. So
>>> for small problems the answer is most likely single threaded. When we are
>>> trying to add automatic parallelization we will need to ask at runtime "how
>>> many threads should this operation use given the data size?" that requires
>>> understanding a lot of information about how fast a users particular
>>> machine can calculate a function and is a pretty hard problem. Eigen does
>>> this, but only for matrix multiplication as they have very good runtime
>>> logic to detect if sharding a large matrix multiply across threads is worth
>>> it.
>>>
>>> The logic for deciding at runtime whether a particular function is
>>> worth moving over to the gpu is going to be a lot of developer and runtime
>>> overhead. imo I think the maintanence would not be worth it.
>>>
>>> This has been attempted previously be @andrjohns
>>> <https://github.com/andrjohns> (and I took a crack at it myself). You
>>> can see that whole conversation here
>>> <#1918>
>>>
>>> —
>>> Reply to this email directly, view it on GitHub
>>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
>>> or unsubscribe
>>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
>>> .
>>> You are receiving this because you were mentioned.Message ID:
>>> ***@***.***>
>>>
>>
>>
>>
>>
>>
|
|
Here, I found this informative.
***@***.***/parallel-reduction-in-cuda-bba5e3d124b9
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 8:07 PM Andre Zapico ***@***.***> wrote:
What I am suggesting is we ignore MCMC for now, and just go with runtime
at evaluating prob distributions. Pretty much all of them use an
exponential. So if there's a slight gain on evaluating computations then
it's totally worth it to add, no? I think a lot of developers do this under
the hood but don't expose it to users. Do the threads navigate through
composite functions (i.e. normal distribution)? no idea. but the tests I
ran seemed to improve performance, if we're not considering auto diff. they
passed tests. I am going for performance, not fancy publications if that
makes sense. But I'm sure devs do this under the hood for game dev etc. The
code I added was only a few lines, and some declaratives.
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 5:25 PM Andre Zapico ***@***.***> wrote:
> And I answered these questions with my benchmarks, so it's not a big
> mystery: Amdahl's law seems to apply.
>
>
> *SteveBronder* left a comment (stan-dev/math#3314)
> <#3314 (comment)>
>
> I'm reposting my reply from here
> <#3314 (comment)>
>
> The main issue is, when does the overhead of threading make it worth
> running in parallel? i.e. if a user has a vector of 100 elements, how many
> threads should be used for exp(x)? None? Two? Threading has a decently high
> overhead cost and for each instantiation of threading you pay for that. So
> for small problems the answer is
>
>
> Best,
>
>
> Andre Zapico
> linkedin.com/in/andre-zapico
> gitub.com/drezap
>
>
> ME Information and Communication Engineering
> University of Electronic Science and Technology of China
>
> Consultant, Owner
> likely llc
> likelyllc.com
>
> Stan Developer
> mc-stan.org
>
> BS Mathematical Sciences: Probabilistic Methods
> BS Statistics
> University of Michigan, Ann Arbor 2017
>
> On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***> wrote:
>
>> And again, in AnderJohns thread I'm not seeing any direct comparisons
>> between threaded and unthreaded. I.e. there's no control and treatment
>> group. We can't just guess. We need to systematically evaluate it.
>>
>> Best,
>>
>>
>> Andre Zapico
>> linkedin.com/in/andre-zapico
>> gitub.com/drezap
>>
>>
>> ME Information and Communication Engineering
>> University of Electronic Science and Technology of China
>>
>> Consultant, Owner
>> likely llc
>> likelyllc.com
>>
>> Stan Developer
>> mc-stan.org
>>
>> BS Mathematical Sciences: Probabilistic Methods
>> BS Statistics
>> University of Michigan, Ann Arbor 2017
>>
>> On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***>
>> wrote:
>>
>>> It's like some declarations, which are essentially just if statements
>>> that determine whether a certain area of code will be compiled or not.
>>>
>>> I'm skimming this I'm waiting on a bootloader for a free Mac I got.
>>>
>>> I'm not seeing any direct comparisons between threaded and non threaded
>>> code, and there seems to be a discrepancy between concurrency and running a
>>> process on different cores.
>>>
>>> I'm with Bob:
>>> #1918 (comment)
>>>
>>> Instead of chatting, let's come up with a concrete way of determining
>>> whether something is faster.
>>>
>>> WRT maintenance, it's like 3 lines of code and some declaratives. Easy
>>> to maintain.
>>>
>>> I seem to have accidentally discovered Ahmdal's law. So I propose we
>>> come up with concrete objectives to benchmarks and if it's faster we
>>> proceed. Also, typing matters.
>>>
>>> And I'm not sure about how reliable the posteriorDB estimates are, but
>>> locally in Stan/math parallelization with tbb was faster within limits
>>> (#threads matters, etc) but if running this on exp with many evaluations of
>>> a gaussian distribution for example for thousands of iterations this could
>>> be worth it. But to play devil's advocate, recollecting threads could also
>>> also slow it down.
>>>
>>> Again, I'm handicapped no computer.
>>>
>>> But in summary, I don't think the linked thread effectively evaluates
>>> whether this is faster or not. All HPC devs use threading, no? Any ringers
>>> we can bring in?
>>>
>>> But wrt maintenance, easy.
>>>
>>> Best,
>>>
>>>
>>> Andre Zapico
>>> linkedin.com/in/andre-zapico
>>> gitub.com/drezap
>>>
>>>
>>> ME Information and Communication Engineering
>>> University of Electronic Science and Technology of China
>>>
>>> Consultant, Owner
>>> likely llc
>>> likelyllc.com
>>>
>>> Stan Developer
>>> mc-stan.org
>>>
>>> BS Mathematical Sciences: Probabilistic Methods
>>> BS Statistics
>>> University of Michigan, Ann Arbor 2017
>>>
>>> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***>
>>> wrote:
>>>
>>>> *SteveBronder* left a comment (stan-dev/math#3314)
>>>> <#3314 (comment)>
>>>>
>>>> I'm reposting my reply from here
>>>> <#3314 (comment)>
>>>>
>>>> The main issue is, when does the overhead of threading make it worth
>>>> running in parallel? i.e. if a user has a vector of 100 elements, how many
>>>> threads should be used for exp(x)? None? Two? Threading has a decently high
>>>> overhead cost and for each instantiation of threading you pay for that. So
>>>> for small problems the answer is most likely single threaded. When we are
>>>> trying to add automatic parallelization we will need to ask at runtime "how
>>>> many threads should this operation use given the data size?" that requires
>>>> understanding a lot of information about how fast a users particular
>>>> machine can calculate a function and is a pretty hard problem. Eigen does
>>>> this, but only for matrix multiplication as they have very good runtime
>>>> logic to detect if sharding a large matrix multiply across threads is worth
>>>> it.
>>>>
>>>> The logic for deciding at runtime whether a particular function is
>>>> worth moving over to the gpu is going to be a lot of developer and runtime
>>>> overhead. imo I think the maintanence would not be worth it.
>>>>
>>>> This has been attempted previously be @andrjohns
>>>> <https://github.com/andrjohns> (and I took a crack at it myself). You
>>>> can see that whole conversation here
>>>> <#1918>
>>>>
>>>> —
>>>> Reply to this email directly, view it on GitHub
>>>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
>>>> or unsubscribe
>>>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
>>>> .
>>>> You are receiving this because you were mentioned.Message ID:
>>>> ***@***.***>
>>>>
>>>
>>>
>>>
>>>
>>>
>
>
|
|
Ok, I am reading through the old threading discussion a bit more
thoroughly. It's cool but many degrees of freedom and would be better to
specifically define what we're trying to thread? Something as simple as
concurrency in an operation that requires a lot of FLOPS could potentially
add some speed. And then isolate auto diff later? The conversation is going
in a bunch of different direction and it's not concrete as to what we're
trying to do. But systematically threading simple functions and evaluations
of values for PDFs might be a starting point. if that adds speed, sure. But
then threading auto diff is a different problem. But starting simple on an
iterative algorithm could add cumulative gains.
See what I'm saying? so there's a concrete gain as opposed to a convoluted
research question?
So ok, thread this, benchmark on all PDFs, and then continue. Just
evaluation, not gradients, then we could mess with auto diff more.
Not sure what percentage or proportion within stans HMC is purely just
evaluation but I think it's non negligible and could speed up.
And then focus on auto diff after. Sound stupid?
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
…On Wed, May 27, 2026, 9:04 PM Andre Zapico ***@***.***> wrote:
Here, I found this informative.
***@***.***/parallel-reduction-in-cuda-bba5e3d124b9
Best,
Andre Zapico
linkedin.com/in/andre-zapico
gitub.com/drezap
ME Information and Communication Engineering
University of Electronic Science and Technology of China
Consultant, Owner
likely llc
likelyllc.com
Stan Developer
mc-stan.org
BS Mathematical Sciences: Probabilistic Methods
BS Statistics
University of Michigan, Ann Arbor 2017
On Wed, May 27, 2026, 8:07 PM Andre Zapico ***@***.***> wrote:
> What I am suggesting is we ignore MCMC for now, and just go with runtime
> at evaluating prob distributions. Pretty much all of them use an
> exponential. So if there's a slight gain on evaluating computations then
> it's totally worth it to add, no? I think a lot of developers do this under
> the hood but don't expose it to users. Do the threads navigate through
> composite functions (i.e. normal distribution)? no idea. but the tests I
> ran seemed to improve performance, if we're not considering auto diff. they
> passed tests. I am going for performance, not fancy publications if that
> makes sense. But I'm sure devs do this under the hood for game dev etc. The
> code I added was only a few lines, and some declaratives.
>
> Best,
>
>
> Andre Zapico
> linkedin.com/in/andre-zapico
> gitub.com/drezap
>
>
> ME Information and Communication Engineering
> University of Electronic Science and Technology of China
>
> Consultant, Owner
> likely llc
> likelyllc.com
>
> Stan Developer
> mc-stan.org
>
> BS Mathematical Sciences: Probabilistic Methods
> BS Statistics
> University of Michigan, Ann Arbor 2017
>
> On Wed, May 27, 2026, 5:25 PM Andre Zapico ***@***.***> wrote:
>
>> And I answered these questions with my benchmarks, so it's not a big
>> mystery: Amdahl's law seems to apply.
>>
>>
>> *SteveBronder* left a comment (stan-dev/math#3314)
>> <#3314 (comment)>
>>
>> I'm reposting my reply from here
>> <#3314 (comment)>
>>
>> The main issue is, when does the overhead of threading make it worth
>> running in parallel? i.e. if a user has a vector of 100 elements, how many
>> threads should be used for exp(x)? None? Two? Threading has a decently high
>> overhead cost and for each instantiation of threading you pay for that. So
>> for small problems the answer is
>>
>>
>> Best,
>>
>>
>> Andre Zapico
>> linkedin.com/in/andre-zapico
>> gitub.com/drezap
>>
>>
>> ME Information and Communication Engineering
>> University of Electronic Science and Technology of China
>>
>> Consultant, Owner
>> likely llc
>> likelyllc.com
>>
>> Stan Developer
>> mc-stan.org
>>
>> BS Mathematical Sciences: Probabilistic Methods
>> BS Statistics
>> University of Michigan, Ann Arbor 2017
>>
>> On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***>
>> wrote:
>>
>>> And again, in AnderJohns thread I'm not seeing any direct comparisons
>>> between threaded and unthreaded. I.e. there's no control and treatment
>>> group. We can't just guess. We need to systematically evaluate it.
>>>
>>> Best,
>>>
>>>
>>> Andre Zapico
>>> linkedin.com/in/andre-zapico
>>> gitub.com/drezap
>>>
>>>
>>> ME Information and Communication Engineering
>>> University of Electronic Science and Technology of China
>>>
>>> Consultant, Owner
>>> likely llc
>>> likelyllc.com
>>>
>>> Stan Developer
>>> mc-stan.org
>>>
>>> BS Mathematical Sciences: Probabilistic Methods
>>> BS Statistics
>>> University of Michigan, Ann Arbor 2017
>>>
>>> On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***>
>>> wrote:
>>>
>>>> It's like some declarations, which are essentially just if statements
>>>> that determine whether a certain area of code will be compiled or not.
>>>>
>>>> I'm skimming this I'm waiting on a bootloader for a free Mac I got.
>>>>
>>>> I'm not seeing any direct comparisons between threaded and non
>>>> threaded code, and there seems to be a discrepancy between concurrency and
>>>> running a process on different cores.
>>>>
>>>> I'm with Bob:
>>>> #1918 (comment)
>>>>
>>>> Instead of chatting, let's come up with a concrete way of determining
>>>> whether something is faster.
>>>>
>>>> WRT maintenance, it's like 3 lines of code and some declaratives. Easy
>>>> to maintain.
>>>>
>>>> I seem to have accidentally discovered Ahmdal's law. So I propose we
>>>> come up with concrete objectives to benchmarks and if it's faster we
>>>> proceed. Also, typing matters.
>>>>
>>>> And I'm not sure about how reliable the posteriorDB estimates are, but
>>>> locally in Stan/math parallelization with tbb was faster within limits
>>>> (#threads matters, etc) but if running this on exp with many evaluations of
>>>> a gaussian distribution for example for thousands of iterations this could
>>>> be worth it. But to play devil's advocate, recollecting threads could also
>>>> also slow it down.
>>>>
>>>> Again, I'm handicapped no computer.
>>>>
>>>> But in summary, I don't think the linked thread effectively evaluates
>>>> whether this is faster or not. All HPC devs use threading, no? Any ringers
>>>> we can bring in?
>>>>
>>>> But wrt maintenance, easy.
>>>>
>>>> Best,
>>>>
>>>>
>>>> Andre Zapico
>>>> linkedin.com/in/andre-zapico
>>>> gitub.com/drezap
>>>>
>>>>
>>>> ME Information and Communication Engineering
>>>> University of Electronic Science and Technology of China
>>>>
>>>> Consultant, Owner
>>>> likely llc
>>>> likelyllc.com
>>>>
>>>> Stan Developer
>>>> mc-stan.org
>>>>
>>>> BS Mathematical Sciences: Probabilistic Methods
>>>> BS Statistics
>>>> University of Michigan, Ann Arbor 2017
>>>>
>>>> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***>
>>>> wrote:
>>>>
>>>>> *SteveBronder* left a comment (stan-dev/math#3314)
>>>>> <#3314 (comment)>
>>>>>
>>>>> I'm reposting my reply from here
>>>>> <#3314 (comment)>
>>>>>
>>>>> The main issue is, when does the overhead of threading make it worth
>>>>> running in parallel? i.e. if a user has a vector of 100 elements, how many
>>>>> threads should be used for exp(x)? None? Two? Threading has a decently high
>>>>> overhead cost and for each instantiation of threading you pay for that. So
>>>>> for small problems the answer is most likely single threaded. When we are
>>>>> trying to add automatic parallelization we will need to ask at runtime "how
>>>>> many threads should this operation use given the data size?" that requires
>>>>> understanding a lot of information about how fast a users particular
>>>>> machine can calculate a function and is a pretty hard problem. Eigen does
>>>>> this, but only for matrix multiplication as they have very good runtime
>>>>> logic to detect if sharding a large matrix multiply across threads is worth
>>>>> it.
>>>>>
>>>>> The logic for deciding at runtime whether a particular function is
>>>>> worth moving over to the gpu is going to be a lot of developer and runtime
>>>>> overhead. imo I think the maintanence would not be worth it.
>>>>>
>>>>> This has been attempted previously be @andrjohns
>>>>> <https://github.com/andrjohns> (and I took a crack at it myself).
>>>>> You can see that whole conversation here
>>>>> <#1918>
>>>>>
>>>>> —
>>>>> Reply to this email directly, view it on GitHub
>>>>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>,
>>>>> or unsubscribe
>>>>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ>
>>>>> .
>>>>> You are receiving this because you were mentioned.Message ID:
>>>>> ***@***.***>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
|
Summary
I wrote a class that contains an
operatorfor exp, which allows use to usetbbfor parallelization of a for loop. It looks like at lower number of observations, the parallelization is marginal, but at higher number of observations the parallelism of the for loop, usingtbb::parallel_for, for example, at ~=32,000 there seems to be a speed up at 4 threads that sustains as we increase the size of theContainer.Tests
I tested for numerical accuracy, which checks out. Moreover, I did the following performance tests:
Side Effects
Yes. If we kick in threads too early, there's actually a slow down in computing
expon a vector with a lower number of observations. May be it would be good if there was a default min threads, or have them kick in only when dataset is a certain size. Moreover, this is just one function, so the result may be different when we have a composite function (Gaussian). I think this may be advantageous at lower number observations, but have not evaluated this.What I've done is added a directive that runs the multithreaded code for only vector, and calls the original code (but it's copy pasted into the STAN_THREADS section) accordingly if the function is not threaded for
exp. I'd be open to a quick re-factor if we wanted to set it up like openCL, and have athreadsdirectory understan\math\prim.Release notes
?
Checklist
Copyright holder: (Andre Zapico, Likely LLC, 2026)
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit)make test-headers)make test-math-dependencies)make doxygen)make cpplint)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested